[Paper-study] Batch Normalization

field : core
understanding : ๐Ÿ˜ƒ๐Ÿ˜ƒ๐Ÿ˜ƒ

Paper study
Author

hoyeon

Published

March 21, 2023

Introduction

  • Deep learning์—์„œ๋Š” ๊ฐ ๋ ˆ์ด์–ด์˜ input์˜ ๋ถ„ํฌ๊ฐ€ ๊ณ„์†ํ•ด์„œ ๋ณ€ํ™”ํ•ฉ๋‹ˆ๋‹ค.
  • ์ด๋Š” ๋„คํŠธ์›Œํฌ์˜ ํ•™์Šต์— ์–ด๋ ค์›€์„ ๊ฐ€์ ธ์˜ต๋‹ˆ๋‹ค.
  • ๋…ผ๋ฌธ์—์„œ๋Š” Batch๋‹จ์œ„์˜ input์„ normalization,shifting,scailing,ํ•˜์—ฌ ๋ถ„ํฌ๋ฅผ ์–ด๋Š์ •๋„ ์ผ์ •ํ•˜๊ฒŒ ์œ ์ง€์‹œํ‚ฌ ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.
  • ์ด๋ฅผ ๋‹น์‹œ์˜ sota๋ชจ๋ธ์— ์ ์šฉํ–ˆ๋”๋‹ˆ ๋™์ผํ•œ ์ •ํ™•๋„๋ฅผ 14๋ฐฐ ์ ์€ training step์œผ๋กœ๋ถ€ํ„ฐ ์–ป์„ ์ˆ˜ ์žˆ์—ˆ์œผ๋ฉฐ ์ƒ๋‹นํ•œ ๊ฒฉ์น˜๋ฅผ ๋‘๊ณ  ์›๋ž˜๋ชจ๋ธ์„ ๋Šฅ๊ฐ€ํ–ˆ์Šต๋‹ˆ๋‹ค.
  • ๋˜ํ•œ ๋™์ผํ•œ ๋ฐฉ๋ฒ•์„ ์ ์šฉํ•œ ์•™์ƒ๋ธ” ๋„คํŠธ์›Œํฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ImageNet classification์—์„œ ๊ฐ€์žฅ ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ๋‚ผ ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.(4.9% top5 validation error,4.8% test error)

Problem setting

  • Deep Neural Network๋Š” ํ•™์Šตํ•˜๋Š” ๋„์ค‘์— ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ๊ณ„์†ํ•ด์„œ ๋ณ€ํ™”ํ•ฉ๋‹ˆ๋‹ค.
  • ๋˜ํ•œ ์—ฌ๋Ÿฌ๊ฐœ์˜ minibatch๊ฐ€ input์ด ๋˜๋Š”๋ฐ ๊ฐ๊ฐ์˜ minibatch๊ฐ€ ์ด๋ฃจ๋Š” ๋ฐ์ดํ„ฐ์˜ ๋ถ„ํฌ๋Š” ๋‹ค๋ฆ…๋‹ˆ๋‹ค..

internal covariate shift
internal covariate shift - ์ถœ์ฒ˜ : ๋™๋นˆ๋‚˜

  • ์ด๋กœ์ธํ•ด์„œ ๊ฐ๊ฐ์˜ hidden layer์— ์ž…๋ ฅ๋˜๋Š” input data์˜ ๋ถ„ํฌ๊ฐ€ ํ•™์Šต๋‹จ๊ณ„์—์„œ ์Šคํ…๋งˆ๋‹ค ๋ณ€ํ™”ํ•˜๋Š” internal covariate shift๊ฐ€ ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค.(์‹ค์ œ๋กœ ํžˆ๋“ ๋ ˆ์ด์–ด์˜ output์€ ๋‹ค์ฐจ์›์ด์ง€๋งŒ ๋น„์œ ์ ์œผ๋กœ 1์ฐจ์›์œผ๋กœ ํ‘œํ˜„ํ•œ ๊ทธ๋ฆผ์ž„.)
  • ํŠนํžˆ ์ด ํ˜„์ƒ์€ Deep nueral network์˜ ํŠน์„ฑ์ƒ ๊นŠ์ด ์œ„์น˜ํ•œ hidden layer์ผ์ˆ˜๋ก ์‹ฌํ•˜๊ฒŒ ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค.
  • ์ด๋Š” ๊นŠ์ด ์œ„์น˜ํ•œ hidden layer์ผ์ˆ˜๋ก ํŒŒ๋ผ๋ฏธํ„ฐ ์—ฐ์‚ฐ์ด ์—ฌ๋Ÿฌ๋ฒˆ ๋ฐ˜๋ณต๋˜์–ด ๋” ์‹ฌํ•œ ๋ณ€ํ™”๋ฅผ ๋งŒ๋“ค๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.
  • ์ด๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋‘ ๊ฐ€์ง€์˜ ๋ฌธ์ œ์ ์„ ์ผ์œผํ‚ต๋‹ˆ๋‹ค.
    1. ํ•™์Šตparameter์˜ converge๊ฐ€ ์–ด๋ ต์Šต๋‹ˆ๋‹ค.
      • input์˜ ๋ถ„ํฌ๊ฐ€ ์ ๋‹นํžˆ ๊ณ ์ •๋œ๋‹ค๋ฉด ๊ทธ์— ๋งž๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ํ•™์Šตํ•˜์—ฌ ์ ๋‹นํ•œ ๊ฐ’์œผ๋กœ ์ˆ˜๋ ดํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค.
      • ๊ทธ๋Ÿฌ๋‚˜ internal covariance shift๊ฐ€ ์ผ์–ด๋‚œ๋‹ค๋ฉด ๊ณ„์†ํ•ด์„œ ์ƒˆ๋กœ์šด ๋ถ„ํฌ์— ๋Œ€ํ•ด ๋‹ค์‹œ ํ•™์Šตํ•ด์•ผ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ˆ˜๋ ด์ด ์–ด๋ ต์Šต๋‹ˆ๋‹ค.
      • ๋น„์œ ํ•˜์ž๋ฉด,๋งˆ์น˜ training set๊ณผ test set์˜ ๋ถ„ํฌ๊ฐ€ ๊ฐ™์œผ๋ฉด ํ•™์Šต์ด ์ž˜๋˜๊ณ  ์•ˆ๋˜๋ฉด ํ•™์Šต์ด ์•ˆ๋˜๋Š” ๊ฒƒ๊ณผ ์œ ์‚ฌํ•ฉ๋‹ˆ๋‹ค.(์ €๋Š” ์ž˜ ์™€๋‹ฟ์ง€๋Š” ์•Š์Šต๋‹ˆ๋‹ค.)
    2. Gradient exploding ๋˜๋Š” Gradient vanishing์ด ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค..
      • ๋ถ„ํฌ์˜ ๋ณ€ํ™”๋กœ ์ธํ•ด ์–ด๋–ค hidden layer์—์„œ ์‹œ๊ทธ๋ชจ์ด๋“œ์˜ input์ด ๋„ˆ๋ฌด ํฌ๋‹ค๋ฉด ๊ธฐ์šธ๊ธฐ๊ฐ€ ๊ฑฐ์˜ ์—†์œผ๋ฉฐ ๋ฏธ๋ถ„๊ณ„์ˆ˜๊ฐ€ 0์— ๊ฐ€๊น๊ธฐ์— ํŒŒ๋ผ๋ฏธํ„ฐ์˜ ์—…๋ฐ์ดํŠธ๊ฐ€ ์ผ์–ด๋‚˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.
Code
import matplotlib.pyplot as plt
import torch
plt.figure(figsize=(10,5))
sig = torch.nn.Sigmoid()
x = torch.linspace(-20,20,50)
z = sig(x)
point_x = torch.tensor(10)
point_z = sig(point_x)
plt.plot(x,z)
plt.scatter(point_x,point_z,s=80,color = "red")
plt.axvline(point_x,color="black",linestyle="--")
<matplotlib.lines.Line2D at 0x1f70973fac0>

  • ์œ„์™€ ๊ฐ™์€ ๋ฌธ์ œ์  ์ฆ‰,internal covariance shift๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ํฌ๊ฒŒ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ํฌ๊ฒŒ 2๊ฐ€์ง€์˜ ๋ฐฉ๋ฒ•์ด ์‹œ๋„๋˜์–ด์™”์Šต๋‹ˆ๋‹ค.
    1. lower learning rate๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์ด ์—ฐ๊ตฌ๋˜์–ด ์™”์Šต๋‹ˆ๋‹ค.
    2. careful parameter initialization.(HE,Xavior)
  • ๊ทธ๋Ÿฌ๋‚˜ ๊ฐ๊ฐ์˜ ๋ฐฉ๋ฒ•๋“ค์€ ๋‹จ์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค.(ํ•™์Šต์‹œ๊ฐ„์˜ ์ƒ์Šน,์ดˆ๊ธฐํ™”์˜ ์–ด๋ ค์›€)
  • ํ•ด๋‹น ๋…ผ๋ฌธ์—์„œ๋Š” internal covariate shift๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๊ฐ๊ฐ์˜ ๋ ˆ์ด์–ด์—์„œ Batch๋‹จ์œ„๋กœ Normalization์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

Method

Intuition

BatchNormalization - ์ถœ์ฒ˜ : JINSOL KIM

  • ์ง๊ด€์ ์œผ๋กœ internal covariate shift๋ฅผ ๋ง‰๊ธฐ ์œ„ํ•ด ๋ถ„ํฌ๊ฐ€ ๊ณ ์ •๋˜๊ฒŒ ํ•˜๋ ค๋ฉด ์œ„์™€ ๊ฐ™์ด ๊ฐ ํžˆ๋“ ๋ ˆ์ด์–ด์˜ output์— normalization์„ ์ทจํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ๋…ผ๋ฌธ์˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์—์„œ ์„ค๋ช…ํ•˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.
  • ๊ทธ๋Ÿฌ๋‚˜ ์ฐพ์•„๋ณธ ํ”ํžˆ Fully connected-layer์™€ activation function์‚ฌ์ด์— batchnormalization layer๋ฅผ ๋†“์Šต๋‹ˆ๋‹ค. ์ด๋Š” ๋…ผ๋ฌธ์˜ ์‹คํ—˜์—์„œ ์‚ฌ์šฉํ•œ ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.
  • ์ •๋ฆฌํ•˜์ž๋ฉด normalization์„ ์ ์šฉํ•˜๋Š” ์œ„์น˜๋Š” ๋ฌธ์ œ๋งˆ๋‹ค ๋‹ค๋ฅด์ง€๋งŒ ํ”ํžˆ๋“ค ์œ„์™€ ๊ฐ™์ด Fully connected layer์™€ activation function์‚ฌ์ด์— ๋†“๋Š”๊ฒŒ ์ผ๋ฐ˜์ ์ด๋ฉฐ ์ด๋Š” ๋น„๊ต์  ์ž์œ ๋กœ์šด ํŽธ์ด๋ผ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์œ„์™€ ๊ฐ™์€ ๋ฐฉ๋ฒ•์œผ๋กœ normalization๋งŒ ์ทจํ•˜๊ฒŒ ๋œ๋‹ค๋ฉด ๋„คํŠธ์›Œํฌ์˜ ํ‘œํ˜„๋ ฅ์„ ๊ฐ์†Œ์‹œํ‚ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์‹œ๊ทธ๋ชจ์ด๋“œ์˜ linear regime์— ๊ฐ’๋“ค์ด ๋Œ€๋‹ค์ˆ˜ ์œ„์น˜ํ•˜๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.(๋‰ด๋Ÿด๋„ท์€ ์„ ํ˜•+๋น„์„ ํ˜• ๋ณ€ํ™˜์„ ํ†ตํ•ด์„œ ๋†’์€ ํ‘œํ˜„๋ ฅ์„ ์ง€๋‹™๋‹ˆ๋‹ค.๋‹จ์ˆœํžˆ normalization๋งŒ ์ทจํ•˜๋ฉด ๋น„์„ ํ˜•ํ•จ์ˆ˜์˜ ์—ญํ• ์ด ๊ฐ์†Œํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.)

figure3 - DNN with learnable parameter

  • ๋”ฐ๋ผ์„œ normalization๋œ ๊ฐ’์„ ์ ์ ˆํ•˜๊ฒŒ shifting,scailingํ•˜๋„๋ก ๊ฐ ๋‰ด๋Ÿฐ์— ๋ถ™๋Š” learnable parameter \(\gamma,\beta\)๋ฅผ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.

์ •๋ฆฌํ•˜์ž๋ฉด BatchNormalization์€ Batch๋‹จ์œ„๋กœ normalization์„ ํ†ตํ•ด internal covariate shift๋ฅผ ๋ง‰๊ณ  ๋™์‹œ์— learnable parameter๋กœ shifting,scailingํ•จ์œผ๋กœ์„œ nonlinearity๋ฅผ ์œ ์ง€ํ•˜์—ฌ gradient vanishing(exploding),ํ•™์Šต์˜ ์–ด๋ ค์›€,ํ‘œํ˜„๋ ฅ์˜ ๊ฐ์†Œ์™€ ๊ฐ™์€ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ–ˆ๋‹ค๊ณ  ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Implementation

Training

Figure4 - BN Algorithm

notation

  • ๋…ผ๋ฌธ์˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ถ€๋ถ„์—์„œ๋Š” Batchnormalization์€ activation function๋ฐ”๋กœ ๋‹ค์Œ์— ์œ„์น˜ํ•˜๋Š” ๊ฒƒ์„ ๊ธฐ์ค€์œผ๋กœ ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค.
  • \(\mathcal{B} = \{x_{1...m}\}\)๋Š” ํฌ๊ธฐ๊ฐ€ m์ธ batch๋ฅผ ์ž…๋ ฅํ–ˆ์„๋•Œ ์ž„์˜์˜ ๋…ธ๋“œ์—์„œ ์ถœ๋ ฅ๋œ m๊ฐœ์˜ scalar๊ฐ’์ด๋‹ค.(activation function์„ ํ†ต๊ณผํ•œ ํ›„์ด๋‹ค.)m๊ฐœ์˜ output์ž…๋‹ˆ๋‹ค.
  • \(\mu_{\mathcal{B}}\),\(\sigma^2_{\mathcal{B}}\)๋Š” ๊ฐ๊ฐ \(\mathcal{B}\)์˜ ํ‰๊ท ,๋ถ„์‚ฐ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.
  • \(\hat{x_i}\)๋Š” \(\mathcal{B}\)์— ์†ํ•˜๋Š” ์ž„์˜์˜ ์›์†Œ \(x_i\)์— normalizationํ•œ ๊ฐ’์ž…๋‹ˆ๋‹ค.
  • ์—ฌ๊ธฐ์„œ \(\epsilon\)์€ ๋งค์šฐ์ž‘์€ ๊ฐ’์„ ์˜๋ฏธํ•˜๋ฉฐ ๋ถ„์‚ฐ์ด 0์ผ๋•Œ์˜ ์—ฐ์‚ฐ์ด ๋ถˆ์•ˆ์ •ํ•ด์ง€๋Š” ๊ฒƒ์„ ๋ง‰์Šต๋‹ˆ๋‹ค.
  • \(y_i\)๋Š” learnable parameter์ธ \(\gamma,\beta\)์— ๋Œ€ํ•œ ๊ฐ’์ด๋ฉฐ \(\text{BN}_{\gamma,\beta}(x_i)\)๋ฅผ ๊ณ„์‚ฐํ•œ ๊ฒฐ๊ณผ์ž…๋‹ˆ๋‹ค.

explanation - ๋จผ์ € ํฌํ‚ค๊ฐ€m์ธ batch์— ๋Œ€ํ•ด์„œ ์–ด๋–ค ๋…ธ๋“œ์—์„œ m๊ฐœ์˜ ์Šค์นผ๋ผ๊ฐ’์ธ \(\mathcal{B}\)๊ฐ€ ์ถœ๋ ฅ๋ฉ๋‹ˆ๋‹ค. - \(\mathcal{B}\)์˜ ํ‰๊ท ,๋ถ„์‚ฐ์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค. - \(\forall x_i \in \mathcal{B}\)์— ๋Œ€ํ•˜์—ฌ normalization์„ ์ทจํ•˜๊ณ  learnable parameter์ธ \(\gamma,\beta\)๋ฅผ ๊ณฑํ•ฉ๋‹ˆ๋‹ค.

ํ•™์Šต๋œ \(\gamma\),\(\beta\)์˜ ์˜ˆ์‹œ
Normalization์—ฐ์‚ฐ์ด ํ•„์š”์—†๋‹ค๊ณ  ํ•™์Šตํ•œ ๊ฒฝ์šฐ,nonlinearity๋ฅผ ์œ ์ง€ํ•˜๋Š” ๊ฒƒ์ด ์ข‹์€ ๊ฒฝ์šฐ,identity๋ฅผ ์œ ์ง€ํ•˜๋Š”๊ฒŒ ์ข‹์€ ๊ฒฝ์šฐ
\[\gamma \approx \sqrt{var[x]},\beta \approx \mathbb{E}[x] \rightarrow \hat{x_i}\approx x_i \] Normalization์—ฐ์‚ฐ์ด ํ•„์š”ํ•˜๋‹ค๊ณ  ํ•™์Šตํ•œ ๊ฒฝ์šฐ,linearity๋ฅผ ๊ฐ€์ง€๋Š” ๊ฒƒ์ด ๊ฒฝ์šฐ,identity๋ฅผ ๋ฒ„๋ฆฌ๋Š”๊ฒŒ ์ข‹์€ ๊ฒฝ์šฐ
\[\gamma \approx 1,\beta \approx 0 \rightarrow \hat{x_i} \approx \frac{x_i-\mu_\mathcal{B}}{\sqrt{\sigma_\mathcal{B}^2-\epsilon}}\]

Test or Inference

  • training์—์„œ๋Š” minibatch๋‹จ์œ„๋กœ ํ‰๊ท ,๋ถ„์‚ฐ์„ ๊ตฌํ•˜์—ฌ normalizationํ•  ์ˆ˜ ์žˆ์ง€๋งŒ test์—์„œ๋Š” ์ด์™€๋Š” ๋‹ค๋ฅด๊ฒŒ minibatch๋‹จ์œ„๋กœ data๊ฐ€ ์ž…๋ ฅ๋˜์ง€ ์•Š์„๋ฟ๋”๋Ÿฌ ๋˜ํ•œ ์ž…๋ ฅ๋˜๋Š” ๋ฐ์ดํ„ฐ๊ฐ€ ํ•œ๊ฐœ์—ฌ๋„ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ์˜ˆ์ธกํ•ด์•ผ ์›ํ•ฉ๋‹ˆ๋‹ค.
  • ๋”ฐ๋ผ์„œ ์ด๋•Œ์—๋Š” training์—์„œ ๊ฐ๊ฐ์˜ ๋ฐฐ์น˜๋“ค๋กœ๋ถ€ํ„ฐ ์–ป์€ ํ‰๊ท ๋“ค๊ณผ ๋ถ„์‚ฐ๋“ค์„ ์ €์žฅํ•ด๋†“๊ณ  test์—์„œ๋Š” ์ด ๊ฐ’๋“ค๋กœ ๋‹ค์‹œ ํ‰๊ท ์„ ์ทจํ•˜์—ฌ(ํ‰๊ท ๋“ค์˜ ํ‰๊ท ) normalization์„ ์ทจํ•ฉ๋‹ˆ๋‹ค.
  • ์ด๋•Œ ๋‹จ์ˆœํ•œ ํ‰๊ท ์„ ์ทจํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ ์–ด๋Š์ •๋„ ํ•™์Šต๋œ ๋„คํŠธ์›Œํฌ์—์„œ ์–ป์–ด์ง„ minibatch๋“ค์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋” ๋งŽ์ด ๊ณ ๋ คํ•˜๊ธฐ ์œ„ํ•ด์„œ movingaverage๋‚˜ exponentialaverage๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • movingaverage๋Š” ํ•™์Šต๋‹จ๊ณ„์—์„œ ์–ป์–ด์ง„ ๊ฐ’(ํ‰๊ท ,๋ถ„์‚ฐ)์˜ ์ผ๋ถ€๋ฅผ ์ง์ ‘ ์ง€์ •ํ•˜์—ฌ ํ‰๊ท ์„ ๊ตฌํ•˜๊ณ  exponentialaverage๋Š” ์–ด๋Š์ •๋„ ์•ˆ์ •๋œ ์ƒํƒœ์˜ ๊ฐ’(๋‚˜์ค‘๊ฐ’)๋“ค์— ๊ฐ€์ค‘์น˜๋ฅผ ๋ถ€์—ฌํ•˜์—ฌ ํ‰๊ท ,๋ถ„์‚ฐ์„ ๊ตฌํ•˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.
\[\begin{aligned} &\hat{x} = \frac{x - \mathbb{E}[x]}{\text{Var}[x] + \epsilon}\\ &y = \frac{\gamma}{\sqrt{\text{var}[x] + \epsilon}}\cdot x + (\beta - \frac{\gamma\mathbb{E}[x]}{\sqrt{\text{Var}[x] + \epsilon}})\\ &\text{where }E[x] = E_\mathcal{B}[\mu_\mathcal{B}],\text{Var}[x] = \frac{m}{m-1}E_\mathcal{B}[\sigma_\mathcal{B}^2] \end{aligned}\]
  • \(\frac{m}{m-1}\)์€ unbiased estimate๋ฅผ ์œ„ํ•˜์—ฌ ๊ณฑํ•ด์ง„ ๊ฐ’์ด๋ฉฐ \(E_{\mathcal{B}}\)๋Š” moving average ๋˜๋Š” exponential average๋ฅผ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.
  • test์—์„œ์˜ normalization์€ ๋‹จ์ˆœํ•œ linear transform์œผ๋กœ ์ทจ๊ธ‰ํ•  ์ˆ˜ ์žˆ๋Š”๋ฐ ์ด๋Š” training์ด ๋๋‚œ ํ›„ ์‚ฌ์ „์— ์ด ๊ฐ’์„ ๊ณ„์‚ฐํ•˜์—ฌ ๋‹จ์ˆœํžˆ ๊ณฑํ•˜๊ณ  ๋”ํ•˜๋Š” ๊ฒƒ์œผ๋กœ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.

Experiments


Figure5 - Mnist experiment

  • (a)๋Š” BN์„ ์‚ฌ์šฉํ•œ ๋„คํŠธ์›Œํฌ์™€ ์‚ฌ์šฉํ•˜์ง€ ์•Š์€ ๋„คํŠธ์›Œํฌ๋ฅผ ๋น„๊ต, (b,c)๋Š” ๊ฐ ๋„คํŠธ์›Œํฌ์˜ hiddenlayer์˜ sigmoid์˜ input 3๊ฐœ๋ฅผ ๋น„๊ตํ•œ ๊ฒƒ
  • ๋„คํŠธ์›Œํฌ๋Š” ๊ฐ๊ฐ 100๊ฐœ์˜ activation์„ ๊ฐ€์ง€๋ฉฐ 3๊ฐœ์˜ hiddenlayer๊ฐ€ ์กด์žฌ
  • (a)๋ฅผ ๋ณด๋ฉดBN์„ ์‚ฌ์šฉํ•œ ๋„คํŠธ์›Œํฌ๊ฐ€ ํ›จ์”ฌ ๋น ๋ฅธ์†๋„๋กœ ์ˆ˜๋ ดํ•˜๊ณ  ์žˆ์Œ์„ ์•Œ ์ˆ˜ ์žˆ์Œ
  • (b,c)๋ฅผ ๋ณด๋ฉด BN์„ ์‚ฌ์šฉํ•œ ๋„คํŠธ์›Œํฌ์—์„œ ๊ฐ’์ด ํ›จ์”ฌ ์•ˆ์ •์ ์ž„์„ ์•Œ ์ˆ˜ ์žˆ์Œ(internal covariate shift๊ฐ€ ์ ์Œ)

  • BN์‚ฌ์šฉํ•œ ๋ชจ๋ธ์ด ์ข‹์•˜๋‹ค~

Reference